We are analyzing the quality of wine based on 11 predictors: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol. Quality, in this instance, is defined as an indicator of its craftsmanship and, thus, desirability which can be used, for example, in pricing. Quality does not necessarily indicate if a wine has gone bad. We will use two data sets – one for white wine and the other for red – to create two respective machine learning models. There are 4899 observations in the white wine data set and 1600 observations in the red wine data set.
We are interested in conducting analysis on both the white and red wine to assess the quality of wine in new wines in production so that they may be accurately priced and marketed.
This project uses data on the white and red wine, which records information of the chemical makeup of the wine.
white_og <- read.csv("Wine Dataset/winequality-white.csv", sep = ";")
red_og <- read.csv("Wine Dataset/winequality-red.csv", sep = ";")To clean our data, we clean the column names, change quality into a factor so we can analyze it with classification models, and add a type of White or Red to each data set. We removed any rows in the white wine data set with a quality value of 9 because there are too few instances, and thus, it inhibits our models from performing correctly later on. The red wine data set did not have any quality values of 9. We also create a combined data frame with both values from white and red wine to see if there are significant differences between red and white evaluations for quality.
In our data split, we put a proportion of .7 of each original data set into a training data set and a proportion of .3 into the testing data sets, stratifying by quality. In this section, we also folded our data into 5 folds for later cross validation use.
set.seed(1234)
white_split <- white %>%
initial_split(prop = 0.7, strata = "quality")
white_train <- training(white_split)
white_test <- testing(white_split)
red_split <- red %>%
initial_split(prop = 0.7, strata = "quality")
red_train <- training(red_split)
red_test <- testing(red_split)
white_fold <- vfold_cv(white_train, v = 5)
red_fold <- vfold_cv(red_train, v = 5)What sort of factors do winemakers and sommeliers look for in a quality wine? Generally, quality is determined by acidity, dryness, flavor profile or taste, alcohol content, and how well the wine is preserved or how it changes as it is stored. In our exploratory data analysis, we will analyze our predictors based on these five categories. First, acidity levels can be summarized through the ph levels, fixed.acidity, volatile.acidity, and citric.acid content. Dryness is determined by the density. Taste can be broken down into sweetness and saltiness, which are caused by residual.sugar and chlorides respectively. We will analyze alcohol content singularly to see its effect on the wine quality. Lastly, sulfurous compounds are what is generally used to preserve wine, so we will analyze free.sulfur.dioxide, total.sulfur.dioxide, and sulphates to see if the way a wine is preserved interacts with wine quality in an interesting way.
Our data can be split into two data sets because experts look for different levels of acidity, sugar, etc. for white wine and red wine. Thus, we will have 3 different representations of the data: one for white wine, one for red wine, and one for both.
All of our predictors are continuous, so we will use box plots, histograms, and scatter plots to visualize our data and determine feature selection.
First, let’s see the distribution of quality between both data sets of wine.
ggplot(combinedWine_og, aes(quality)) + geom_bar(color = "black", fill = "pink") + labs(title = "Histogram of Quality - Total Wine") + xlab("Quality of Wine") + ylab("Count") We can see that it is normally distributed, meaning that most wine has a quality value of 5 or 6, with few exceptionally good wines having a value of 8 or 9, and low quality wines having a quality value of 3. Based on their low frequency, we can further justify selecting against of quality values of 9 in our initial data cleaning.
Next, we look at the correlation matrices for white and red wine separately to determine which predictors are correlated.
white %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = 'lower', diag = FALSE,
method = 'color', main = 'White Wine Correlation Plot')red %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = 'lower', diag = FALSE,
method = 'color', main = 'Red Wine Correlation Plot')In the white wine correlation matrix, density and residual sugar; and density and alcohol are the predictors with the highest correlation. Total sulfur dioxide and free sulfur dioxide also have a moderate correlation.
In the red wine correlation matrix, citric acid and fixed acidity; density and fixed acidity; citric acid and volatile acidity; pH and fixed acidity; and free sulfur dioxide and total sulfur dioxide are highly correlated with each other.
To visualize and validate these correlations, let’s take a look at the scaled scatter plot of each predictor plotted against its correlated counterpart.
# scaled data sets
scaled_white = as.data.frame(scale(select(white, c(-quality,-type))))
scaled_red = as.data.frame(scale(select(red, c(-quality,-type))))
# scaled white residual sugar versus density
ggplot(scaled_white, aes(x = residual.sugar, y = density)) + geom_point()+scale_x_continuous(name = "Residual Sugar") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle(" Residual Sugar Versus Density") + theme(plot.title = element_text(size = 20))# scaled white alcohol versus density
ggplot(scaled_white, aes(x = alcohol, y = density)) + geom_point()+scale_x_continuous(name = "Alcohol") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Alcohol Versus Density") + theme(plot.title = element_text(size = 20))# scaled white free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_white, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 20))# scaled red volatile acidity versus citric acid
ggplot(scaled_red, aes(x = volatile.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Volatile Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Volatile Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 20))# scaled red fixed acidity versus citric acid
ggplot(scaled_red, aes(x = fixed.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 20))# scaled red fixed acidity versus pH
ggplot(scaled_red, aes(x = fixed.acidity, y = pH)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "pH") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus pH") + theme(plot.title = element_text(size = 20))# scaled red fixed acidity versus density
ggplot(scaled_red, aes(x = fixed.acidity, y = density)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Density") + theme(plot.title = element_text(size = 20))# scaled red free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_red, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Dioxide Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 20))Based on the scatter plots, we can visualize the correlations between the predictors. For example, for white wine, density has a strong positive correlation with residual sugar and a moderate negative correlation with alcohol. Through these scatter plots, we confirm the existence of correlations predicted by our initial correlation matrix.
Now, we can take a look at the box plots for several of our predictors to see the ways that they interact with wine quality, isolated from the other predictors. First, we will visualize acidity levels which can be measured through fixed acidity, volatile acidity, and citric acid levels. As shown above in the scatter plots, these three predictors are highly correlated with each other in red wine.
#fixed acidity
ggplot(combinedWine, mapping = aes(x = `fixed.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Fixed Acidity Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) +coord_cartesian( xlim = c(0,16), ylim = NULL, default = FALSE )# volatile acidity
ggplot(combinedWine, mapping = aes(x = `volatile.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Volatile Acidity Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )# citric acid
ggplot(combinedWine, mapping = aes(x = `citric.acid`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Citric Acid Levels Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )From these box plots, we can see that fixed acidity levels are relatively consistent in both red and white wine. Volatile acidity has a negative correlation with quality in red wine, but relatively consistent averages for each level of wine quality in white wine. Citric acid levels in red wine have a stronger positive correlation than in white wine. In general, we can see that acidity levels fluctuate more in red wine than in white wine.
Next, let’s take a look at dryness which is determined by the predictor density. Based on the correlation matrix and scatter plots, density also is correlated with residual sugar and alcohol in white wine and with fixed acidity in red wine.
# density
ggplot(combinedWine, mapping = aes(x = density, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = density, y = quality)) + labs(title = "Red and White Density Levels versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(.985,1.01), ylim = NULL, default = FALSE )Although density is correlated with several predictors according to the correlation matrices and scatter plots, in this box plot, we can see that density stays relatively consistent, around 1, for each level of wine quality.
Next, we will look at the taste of the wine, which is determined by levels of sweetness and saltiness. These are affected by sugar levels and chlorides respectively.
#sugar content
ggplot(combinedWine, mapping = aes(x = residual.sugar, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = residual.sugar, y = quality)) + labs(title = "Red and White Residual Sugar Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,25), ylim = NULL, default = FALSE )# we removed outliers to ensure that the variation was not due to the outliers
ggplot(red[red$free.sulfur.dioxide < 50,], aes(x = free.sulfur.dioxide, y = quality)) +
geom_boxplot(aes(fill = quality)) +
labs(title = "Free Sulfur Dioxide for Red Wine", x = "Free Sulfur Dioxide", y = "Quality") +
geom_point(width = 0.15) +
scale_fill_brewer(palette = "RdPu")#chlorides
ggplot(combinedWine, mapping = aes(x = chlorides, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = chlorides, y = quality)) + labs(title = "Red and White Chloride Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,0.3), ylim = NULL, default = FALSE )White wine has, on average, higher and more variable sugar levels than red wine while red wine has an on average higher chloride content than white wine. There seem to be a higher number of outliers in the values of chloride.
Next, we will analyze alcohol content, which can affect the taste of the wine as well.
ggplot(combinedWine, mapping = aes(x = alcohol, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Alcohol Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) There is generally higher alcohol content associated with a wine of higher quality but there is not a significant different in averages between red wine and white wine.
Lastly, let’s look at the preservative content, which is determined by free sulfur dioxide, total sulfur dioxide, and sulfates.
#free sulfur dioxide
ggplot(combinedWine, mapping = aes(x = free.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = free.sulfur.dioxide, y = quality)) + labs(title = "Red and White Free Sulfur Dioxide Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,150), ylim = NULL, default = FALSE )#total sulfur dioxide
ggplot(combinedWine, mapping = aes(x = total.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = total.sulfur.dioxide, y = quality)) + labs(title = "Red and White Total Sulfur Dioxide Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,300), ylim = NULL, default = FALSE )#sulfates
ggplot(combinedWine, mapping = aes(x = sulphates, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = sulphates, y = quality)) + labs(title = "Red and White Sulfate Content versus Quality", fill = "Quality") + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1.5), ylim = NULL, default = FALSE )From the box plots we can see that red wine generally has a lower sulfur dioxide content than white wine. Also, averages across each stratification of quality have similar values except for sulfates in red wine, which have a slight positive correlation with quality.
CONCLUDING SENTENCE; CONNECT TO HOW KNOWING THESE THINGS AFFECTS OUR MODELLING
We did all of our modeling in R-Scripts for efficiency purposes, since models generally take a long time to run. We will load all of results here, and intermittently call variables throughout the report to visualize our calculations.
# LDA
load("WhiteLDA.rda")
load("RedLDA.rda")
# LDA with PCA
load("WhiteLDAPCA.rda")
load("RedLDAPCA.rda")
# Decision Tree
load("WhiteWineDecisionTree.rda")
load("RedWineDecisionTree.rda")
# Random Forest
load("WhiteWineRandomForest.rda")
load("RedWineRandomForest.rda")
# Boosted Trees
load("WhiteWineBoostedTrees.rda")
load("RedWineBoostedTrees.rda")We will be fitting linear discriminant analysis, naive Bayes, single decision tree, random forest, and boosted tree models and compare accuracy metrics. Then, we will fit the three models with the best roc_auc to our testing data. First, let’s see how the models perform on the white wine data set.
white_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())Let’s first explore linear discriminant analysis and naive Bayes classification through k-fold cross validation.
For the linear discriminant analysis model, we use a classification mode and set the engine to MASS. We then add the model and recipe to a workflow and create a fit between the workflow and folded data. We are using roc_auc to evaluate accuracy.
#set up model with mode classification and engine MASS
wlda_model <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
#add model and recipe to the workflow
wlda_wkflow<- workflow() %>%
add_model(wlda_model) %>%
add_recipe(white_recipe)
#create a fit between the workflow and folded data
wlda_fit_cross <- fit_resamples(wlda_wkflow, white_fold)
#determine the roc_auc of the LDA model on the folded training data
collect_metrics(wlda_fit_cross)## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.528 5 0.00640 Preprocessor1_Model1
## 2 roc_auc hand_till 0.735 5 0.0159 Preprocessor1_Model1
Now, let’s take a look at how our cross validation method works with a Naive Bayes model. In particular, let’s see if the accuracy increases.
#set up model with mode classification and engine kLaR
#we used set_args(use_kernel = FALSE) based on Lab 3
wnb_mod <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("klaR") %>%
set_args(usekernel = FALSE)
#add model and recipe to the workflow
wnb_wkflow <- workflow() %>%
add_model(wnb_mod) %>%
add_recipe(white_recipe)
#create a fit between the workflow and folded data
wnb_fit_cross <- fit_resamples(wnb_wkflow, white_fold)
#determine the roc_auc of the Naive Bayes model on the folded training data
collect_metrics(wnb_fit_cross)## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.451 5 0.00984 Preprocessor1_Model1
## 2 roc_auc hand_till 0.734 5 0.0200 Preprocessor1_Model1
Through k-fold cross validation, we can see that the linear discriminant analysis model produces better accuracy than the Naive Bayes model.
Although, to account for the collinearity between some of our predictors, which we saw in the exploratory data analysis, we will conduct principal component analysis on the data. Since our linear discriminant analysis model was better on the white wine data set, we will use the principal components in an LDA model.
To conduct principal component analysis, we will begin by setting up a recipe specifically for this purpose. We can now conduct an LDA workflow and model fit. We are tuning the model to find the best number of principal components using k-fold cross validation.
white_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())
# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(white_recipe_pca$template))
# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)
wlda_pca_wkflow <- workflow() %>%
add_model(wlda_model) %>%
add_recipe(white_recipe_pca)
pca_lda_fit <- wlda_pca_wkflow %>%
tune_grid(resamples = white_fold,
grid = tuneGrid,
metrics = metric_set(accuracy),
control = trControl)This visualization represents the number of principal components versus the accuracy of the model. We can observe an obvious spike in accuracy at 9 principal components.
pca_lda_metrics <- pca_lda_fit %>% collect_metrics()
ggplot(pca_lda_metrics, aes(x = num_comp, y = mean)) +
geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
scale_x_continuous(breaks = 1:ncol(white_recipe_pca$template)) +
facet_wrap(~.metric) +
theme_bw()pca_lda_fit %>% show_best(metric = "accuracy")
(bestTune <- pca_lda_fit %>% select_by_one_std_err(num_comp, metric = "accuracy"))
wlda_pca_wkflow_final <- wlda_pca_wkflow %>% finalize_workflow(bestTune)
fit_final <- wlda_pca_wkflow_final %>% fit(white_train)
white.PCALDA <- tibble(white_train,
predict(fit_final, new_data = white_train, type = "class"), # predicted class
predict(fit_final, new_data = white_train, type = "prob"), # posterior prob. for classes
as_tibble(predict(fit_final, new_data = white_train, type = "raw")$x)) # LD scoresThis is a visualization of the actually quality and the predicted qualities. We are only displaying about half of the data, so the plot is more interpretable. The plot displays the clustering of the data very well for each quality level. In addition, we can visually see how well the model predicts the qualities accurately, and around how often/how greatly the model fails.
# plot
ggplot(white.PCALDA[1:1500,], aes(x = LD1, y = LD2)) +
geom_point(aes(color = quality, shape = .pred_class)) +
theme_bw() +
ggtitle("PCA-LDA (DAPC) on White Wine Training dataset, using 9 PC")#augmented on training
pcalda_fit <- augment(fit_final, new_data = white_train)
pcalda_acc <- pcalda_fit %>% accuracy(truth = quality, estimate = .pred_class)
pcalda_rocauc <- pcalda_fit %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA ")
pcalda_roccurve <- pcalda_fit %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
pcalda_confusionmatrix <- pcalda_fit %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")These metrics and graphs reveal how accurately the PCA LDA works on the training data. The ROC-AUC curves are the best for qualities 4 and 8, although overall PCA LDA is not the most effective as the accuracy is merely 53%. Through the confusion matrix, we can see the model predicted 6s well.
pcalda_acc## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.538
pcalda_rocauc## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.760 "White Wine LDA Model using PCA "
pcalda_roccurvepcalda_confusionmatrixNow, let’s try several tree methods to see if the produce more accurate results on the training data set of white wine. We will first look at the model for a single decision tree.
First, we set up a specification with the engine rpart and for classification.
# decision tree specification
wtree_spec <- decision_tree() %>%
set_engine("rpart")
wtree_spec_class <- wtree_spec %>%
set_mode("classification")Next, we fit the specification to the training data.
wclass_tree_fit <- wtree_spec_class %>%
fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)Here is a visual of how the decision tree model works with our data.
wclass_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()Now, we augment the model on the training model and evaluate the accuracy and confusion matrix. The accuracy is 52% and the confusion matrix shows us that 5 and 6 quality are evaluated the best.
# augmented on training
augment(wclass_tree_fit, new_data = white_train) %>%
accuracy(truth = quality, estimate = .pred_class)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.529
augment(wclass_tree_fit, new_data = white_train) %>%
conf_mat(truth = quality, estimate = .pred_class)%>% autoplot(type = "heatmap")Here, we are tuning our model to determine the best measures for cost_complexity.
# tuning cost complexity
wclass_tree_wf<- workflow() %>%
add_model(wtree_spec_class %>%
set_args(cost_complexity = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)
param_grid <- grid_regular(cost_complexity(range = c(-3,-1)), levels = 10)
tune_res_white <- tune_grid(
wclass_tree_wf,
resamples = white_fold,
grid = param_grid,
metric = metric_set(accuracy)
)Now we produce a graph of accuracies and roc auc levels for various cost complexity parameters.
autoplot(tune_res_white)wAutoPlot# extracting the best cost complexity parameter
best_complexity <- select_best(tune_res_white)
wclass_tree_final <- finalize_workflow(wclass_tree_wf, best_complexity)
wclass_tree_final_fit <- fit(wclass_tree_final, data = white_train)Using the measure of cost complexity which produces the best accuracy and roc auc levels, this is a visualization of the decision tree that is used in the model. It is very precise and may show signs that it would overfit on testing data.
wclass_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()# augmented on training
wdectree_pred <- augment(wclass_tree_final_fit, new_data = white_train)
wdectree_acc <- wdectree_pred %>% accuracy(truth = quality, estimate = .pred_class)
wdectree_rocauc <- wdectree_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Decision Tree Model")
wdectree_roccurve <- wdectree_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wdectree_confusionmatrix <- augment(wclass_tree_final_fit, new_data = white_train) %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")This is the overall accuracy measures, roc auc measures, roc curves and confusion matrix that is returned my our decision tree model on the white wine data set. The accuracy is on par with the PCA LDA that we previously conducted.
print(wdectree_acc)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.647
print(wdectree_rocauc)## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.818 White Decision Tree Model
wdectree_roccurvewdectree_confusionmatrixIn hopes of improving our results, we will now look at the random forest model.
# setting random forest model up
wrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")wrandfor_wf <- workflow() %>%
add_model(wrandfor %>%
set_args(mtry = tune(), trees = tune(), min_n = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)# tuning the model to find the best arguments
param_grid1 <- grid_regular(mtry(range = c(1,9)), trees(range = c(15,17)), min_n(range = c(30,50)), levels = 8)
wtune_res_randfor <- tune_grid(
wrandfor_wf,
resamples = white_fold,
grid = param_grid1,
metric = metric_set(accuracy)
)This visualization shows the accuracy and roc auc of the various values we are tuning our model with. For the final recipe, we will extract the metrics that return the best accuracy and roc auc.
autoplot(wtune_res_randfor)# collecting metrics to find best mean
wbest_rocauc1 <- collect_metrics(wtune_res_randfor) %>% arrange(desc(mean))
wbest_metric1 <- select_best(wtune_res_randfor)
wrandfor_final <- rand_forest(mtry = 2, trees = 17, min_n = 30) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
wrandfor_fit_final <- fit(wrandfor_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)# variance importance plot
wVIP <- vip(wrandfor_fit_final)
# extracting the metrics
wrandfor_pred <- augment(wrandfor_fit_final, new_data = white_train)
wrandfor_acc <- wrandfor_pred %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
wrandfor_rocauc <- wrandfor_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
wrandfor_roccurve <- wrandfor_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wrandfor_confusionmatrix <- augment(wrandfor_fit_final, new_data = white_train) %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")wVIPprint(wrandfor_acc)## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.800 White Random Forest Model
print(wrandfor_rocauc)## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.968 White Random Forest Model
wrandfor_roccurvewrandfor_confusionmatrixOverall the random forest for the white wine data set has a drastic improvement in terms of the accuracy and roc auc. The variance importance plot reveals that alcohol, volatile acidity and density are most important variables in the data. Every single ROC AUC curve is almost perfect, although as we have learned through the course, this model may be over fitting, and as a result, may not perform as well on the testing data set. This model did the best for predicting wine qualities of 5,6, and 7.
Finally, let’s look at the boosted tree model.
wboost_spec <- boost_tree(tree_depth = 5) %>%
set_engine("xgboost") %>%
set_mode("classification")wboost_wf <- workflow() %>%
add_model(wboost_spec %>%
set_args(trees = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)param_grid3 <- grid_regular(trees(range = c(10,2000)), levels = 10)
wtune_res_boosted <- tune_grid(
wboost_wf,
resamples = white_fold,
grid = param_grid3
)This plot below visualizes the accuracy and roc auc levels for the number of trees. Since we tune the value for the number of trees, we can choose to use the number of trees which has the highest accuracy and roc auc levels.
wBoostedAutoPlotwbest_rocauc2 <- collect_metrics(wtune_res_boosted) %>% arrange(desc(mean))
wbest_metric2 <- select_best(wtune_res_boosted)
print(wbest_metric2)wboost_final <- boost_tree(tree_depth = 5, trees = 231)%>%
set_engine("xgboost") %>%
set_mode("classification")wboost_fit_final <- fit(wboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)# augmenting
wpredicted <- augment(wboost_fit_final, new_data = white_train)
wboosted_acc <- wpredicted %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
wboosted_rocauc <- wpredicted %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
wBoostedROCCurve <- wpredicted %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wBoostedConfusionMatrix <- wpredicted %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")wboosted_acc## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 1 White Boosted Trees Model
wboosted_rocauc## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 1 White Boosted Trees Model
wBoostedROCCurvewBoostedConfusionMatrixThe accuracy and roc auc for the boosted trees model on the white wine dataset is 1, meaning every single observation was correctly classified in the model. This model has the potential to run amazingly on the testing data, although there is undeniable evidence of overfitting.
wbest_roc_table <- rbind(wbest_rocauc[1,c(2,4)], wbest_rocauc1[1,c(4,6)], wbest_rocauc2[1,c(2,4)] ) %>% mutate(model_type = c("Decision Tree", "Random Forest", "Boosted Trees"))
wbest_roc_table## # A tibble: 3 × 3
## .metric mean model_type
## <chr> <dbl> <chr>
## 1 roc_auc 0.688 Decision Tree
## 2 roc_auc 0.794 Random Forest
## 3 roc_auc 0.805 Boosted Trees
# We will test our model using Boosted Trees and Random Forest As we can see, between the three tree based methods we ran, the two with the highest roc auc were Boosted Trees and Random Forest. LDA using PCA had a higher accuracy than Decision Trees so we will use that as our third model.
To conclude, we will test our data on three models: LDA using PCA, Random Forest and Boosted Trees.
Testing the PCA LDA model on the data:
pcalda_fit_test <- augment(fit_final, new_data = white_test)
pcalda_test_acc <- pcalda_fit_test %>% accuracy(truth = quality, estimate = .pred_class)
pcalda_test_rocauc <- pcalda_fit_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA")
pcalda_test_roccurve <- pcalda_fit_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
pcalda_test_confusionmatrix <- pcalda_fit_test %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")pcalda_test_acc## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.514
pcalda_test_rocauc## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.685 White Wine LDA Model using PCA
pcalda_test_roccurvepcalda_test_confusionmatrixTesting the Random Forest model on the data:
wrandfor_pred_test <- augment(wrandfor_fit_final, new_data = white_test)
wrandfor_acc_test <- wrandfor_pred_test %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
wrandfor_rocauc_test <- wrandfor_pred_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
wrandfor_roccurve_test <- wrandfor_pred_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wrandfor_confusionmatrix_test <- augment(wrandfor_fit_final, new_data = white_test) %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")wrandfor_acc_test## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.608 White Random Forest Model
wrandfor_rocauc_test## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.764 White Random Forest Model
wrandfor_roccurve_testwrandfor_confusionmatrix_testTesting the Boosted Trees model on the data:
wpredictedtest <- augment(wboost_fit_final, new_data = white_test)
wboosted_acc_test <- wpredictedtest %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
wboosted_rocauc_test <- wpredictedtest %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
wBoostedROCCurveTesting <- wpredictedtest %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wBoostedConfusionMatrixTesting <- wpredictedtest %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")wboosted_acc_test## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.657 White Boosted Trees Model
wboosted_rocauc_test## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.821 White Boosted Trees Model
wBoostedROCCurveTestingwBoostedConfusionMatrixTestingIn conclusion, the Boosted Trees model did the best on the testing data for the White Wine data set with an accuracy of 65.7% and a roc auc level of 0.82. The Random Forest model is a close second with accuracy 60.8% and a roc auc value of 0.76. LDA did not work very well, and using principal component analysis did not have much of an improvement. We thought that it would due to the fact that many of our predictors had high correlation. Majority of our models do well in predicting values of 5 and 6. Although, in general, we have seen that all of the models thus far do not do as well with 3,4, and 8. This is probably due to the fact that there are not as many observations with this quality level.
Next, let’s see how the models perform on the red wine data set. We will be using the same type of models that we used in white wine in order to keep things consistent.
red_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())#lda model using cross validation
rlda_model <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
rlda_wkflow<- workflow() %>%
add_model(rlda_model) %>%
add_recipe(red_recipe)
rlda_fit_cross <- fit_resamples(rlda_wkflow, red_fold)
collect_metrics(rlda_fit_cross)## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.601 5 0.0127 Preprocessor1_Model1
## 2 roc_auc hand_till 0.786 5 0.00468 Preprocessor1_Model1
#naive bayes model using cross validation
rnb_mod <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("klaR") %>%
set_args(usekernel = FALSE)
rnb_wkflow <- workflow() %>%
add_model(rnb_mod) %>%
add_recipe(red_recipe)
rnb_fit_cross <- fit_resamples(rnb_wkflow, red_fold)
collect_metrics(rnb_fit_cross)## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy multiclass 0.534 5 0.00828 Preprocessor1_Model1
## 2 roc_auc hand_till 0.743 5 0.00714 Preprocessor1_Model1
Through k-fold cross validation, we can see that the linear discriminant analysis model produces a more accurate model than the Naive Bayes model. Thus, we will also use linear discriminant analysis on the red wine data set.
red_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(red_recipe_pca$template))# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)rlda_pca_wkflow <- workflow() %>%
add_model(rlda_model) %>%
add_recipe(red_recipe_pca)
rpca_lda_fit <- rlda_pca_wkflow %>%
tune_grid(resamples = red_fold,
grid = tuneGrid,
metrics = metric_set(accuracy),
control = trControl)
rpca_lda_metrics <- rpca_lda_fit %>% collect_metrics()ggplot(rpca_lda_metrics, aes(x = num_comp, y = mean)) +
geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
scale_x_continuous(breaks = 1:ncol(red_recipe_pca$template)) +
facet_wrap(~.metric) +
theme_bw()rpca_lda_fit %>% show_best(metric = "accuracy")
(bestTune <- rpca_lda_fit %>% select_by_one_std_err(num_comp, metric = "accuracy"))
rlda_pca_wkflow_final <- rlda_pca_wkflow %>% finalize_workflow(bestTune)
rfit_final <- rlda_pca_wkflow_final %>% fit(red_train)red.PCALDA <- tibble(red_train,
predict(rfit_final, new_data =red_train, type = "class"), # predicted class
predict(rfit_final, new_data = red_train, type = "prob"), # posterior prob. for classes
as_tibble(predict(rfit_final, new_data = red_train, type = "raw")$x)) # LD scores# plot
ggplot(red.PCALDA, aes(x = LD1, y = LD2)) +
geom_point(aes(color = quality, shape = .pred_class)) +
theme_bw() +
ggtitle("PCA-LDA (DAPC) on Red Wine Training dataset, using 9 PC")rpcalda_fit <- augment(rfit_final, new_data = red_train)
rpcalda_acc <- rpcalda_fit %>% accuracy(truth = quality, estimate = .pred_class)
rpcalda_rocauc <- rpcalda_fit %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Wine LDA Model using PCA ")
rpcalda_roccurve <- rpcalda_fit %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rpcalda_confusionmatrix <- rpcalda_fit %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")rpcalda_acc## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.618
rpcalda_rocauc## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.826 "White Wine LDA Model using PCA "
rpcalda_roccurverpcalda_confusionmatrix# decision tree specification
rtree_spec <- decision_tree() %>%
set_engine("rpart")
# setting mode to classification
rtree_spec_class <- rtree_spec %>%
set_mode("classification")rclass_tree_fit <- rtree_spec_class %>%
fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)rclass_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()# augmented on training
augment(rclass_tree_fit, new_data = red_train) %>%
accuracy(truth = quality, estimate = .pred_class)
augment(rclass_tree_fit, new_data = red_train) %>%
conf_mat(truth = quality, estimate = .pred_class)# tuning cost complexity
rclass_tree_wf<- workflow() %>%
add_model(rtree_spec_class %>%
set_args(cost_complexity = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)param_grid <- grid_regular(cost_complexity(range = c(-3,-1)), levels = 10)
tune_res_red <- tune_grid(
rclass_tree_wf,
resamples = red_fold,
grid = param_grid,
metric = metric_set(accuracy)
)autoplot(tune_res_red)# extracting the best cost complexity parameter
best_complexity <- select_best(tune_res_red)
rclass_tree_final <- finalize_workflow(rclass_tree_wf, best_complexity)
rclass_tree_final_fit <- fit(rclass_tree_final, data = red_train)rclass_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()# augmented on training
augment(rclass_tree_final_fit, new_data = red_train) %>%
accuracy(truth = quality, estimate = .pred_class)
augment(rclass_tree_final_fit, new_data = red_train) %>%
conf_mat(truth = quality, estimate = .pred_class)# augmented on testing
augment(rclass_tree_final_fit, new_data = red_test) %>%
conf_mat(truth = quality, estimate = .pred_class)
augment(rclass_tree_final_fit, new_data = red_test) %>%
accuracy(truth = quality, estimate = .pred_class)rrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")rrandfor_wf <- workflow() %>%
add_model(rrandfor %>%
set_args(mtry = tune(), trees = tune(), min_n = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)# tuning the model to find the best arguments
param_grid2 <- grid_regular(mtry(range = c(1,9)), trees(range = c(15,17)), min_n(range = c(30,50)), levels = 8)
rtune_res_randfor <- tune_grid(
rrandfor_wf,
resamples = red_fold,
grid = param_grid2
)rAutoPlotRF# collecting metrics to find best mean
rbest_rocauc1 <- collect_metrics(rtune_res_randfor) %>% arrange(desc(mean))
print(rbest_rocauc1)
rbest_metric1 <- select_best(rtune_res_randfor)
rrandfor_final <- rand_forest(mtry = 7, trees = 17, min_n = 32) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
rrandfor_fit_final <- fit(rrandfor_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol,data = red_train)rVIP <- vip(rrandfor_fit_final)
rVIPrboost_spec <- boost_tree(tree_depth = 5) %>%
set_engine("xgboost") %>%
set_mode("classification")rboost_wf <- workflow() %>%
add_model(rboost_spec %>%
set_args(trees = tune())) %>%
add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)param_grid4 <- grid_regular(trees(range = c(10,2000)), levels = 10)
rtune_res_boosted <- tune_grid(
rboost_wf,
resamples = red_fold,
grid = param_grid4
)rBoostedAutoPlotrbest_rocauc2 <- collect_metrics(rtune_res_boosted) %>% arrange(desc(mean))
print(rbest_rocauc2)
rbest_metric2 <- select_best(rtune_res_boosted)
print(rbest_metric2)
rboost_final <- boost_tree(tree_depth = 5, trees = 231)%>%
set_engine("xgboost") %>%
set_mode("classification")
rboost_fit_final <- fit(rboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)
rpredicted <- augment(rboost_fit_final, new_data = red_train)
rboosted_acc <- rpredicted %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
rboosted_rocauc <- rpredicted %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
rBoostedROCCurve <- rpredicted %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rBoostedConfusionMatrix <- rpredicted %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")rbest_rocauc
rbest_rocauc1
rbest_rocauc2
rbest_roc_table <- rbind(rbest_rocauc[1,c(2,4)], rbest_rocauc1[1,c(4,6)], rbest_rocauc2[1,c(2,4)] ) %>% mutate(model_type = c("Decision Tree", "Random Forest", "Boosted Trees"))
rbest_roc_tablerpcalda_fit_test <- augment(rfit_final, new_data = red_test)
rpcalda_test_acc <- rpcalda_fit_test %>% accuracy(truth = quality, estimate = .pred_class)
rpcalda_test_rocauc <- rpcalda_fit_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "Red Wine LDA Model using PCA")
rpcalda_test_roccurve <- rpcalda_fit_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rpcalda_test_confusionmatrix <- rpcalda_fit_test %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")rpcalda_test_acc## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.561
rpcalda_test_rocauc## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.778 Red Wine LDA Model using PCA
rpcalda_test_roccurverpcalda_test_confusionmatrixrrandfor_pred_test <- augment(wrandfor_fit_final, new_data = white_test)
rrandfor_acc_test <- wrandfor_pred_test %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Random Forest Model")
rrandfor_rocauc_test <- wrandfor_pred_test %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Random Forest Model")
rrandfor_roccurve_test <- wrandfor_pred_test %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rrandfor_confusionmatrix_test <- augment(wrandfor_fit_final, new_data = white_test) %>%
conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")rrandfor_pred_test## # A tibble: 481 × 20
## fixed…¹ volat…² citri…³ resid…⁴ chlor…⁵ free.…⁶ total…⁷ density pH sulph…⁸
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65
## 2 5.6 0.615 0 1.6 0.089 16 59 0.994 3.58 0.52
## 3 8.9 0.62 0.19 3.9 0.17 51 148 0.999 3.17 0.93
## 4 8.1 0.56 0.28 1.7 0.368 16 56 0.997 3.11 1.28
## 5 7.9 0.32 0.51 1.8 0.341 17 56 0.997 3.04 1.08
## 6 8.9 0.22 0.48 1.8 0.077 29 60 0.997 3.39 0.53
## 7 7.6 0.39 0.31 2.3 0.082 23 71 0.998 3.52 0.65
## 8 8.5 0.49 0.11 2.3 0.084 9 67 0.997 3.17 0.53
## 9 7.6 0.41 0.24 1.8 0.08 4 11 0.996 3.28 0.59
## 10 6.9 0.685 0 2.5 0.105 22 37 0.997 3.46 0.57
## # … with 471 more rows, 10 more variables: alcohol <dbl>, quality <fct>,
## # type <chr>, .pred_class <fct>, .pred_3 <dbl>, .pred_4 <dbl>, .pred_5 <dbl>,
## # .pred_6 <dbl>, .pred_7 <dbl>, .pred_8 <dbl>, and abbreviated variable names
## # ¹fixed.acidity, ²volatile.acidity, ³citric.acid, ⁴residual.sugar,
## # ⁵chlorides, ⁶free.sulfur.dioxide, ⁷total.sulfur.dioxide, ⁸sulphates
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
rrandfor_acc_test## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.632 Red Random Forest Model
rrandfor_rocauc_test## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.792 White Random Forest Model
rrandfor_roccurve_testrrandfor_confusionmatrix_testrpredictedtest <- augment(wboost_fit_final, new_data = white_test)
rboosted_acc_test <- wpredictedtest %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
rboosted_rocauc_test <- wpredictedtest %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
rBoostedROCCurveTesting <- wpredictedtest %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rBoostedConfusionMatrixTesting <- wpredictedtest %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")rpredictedtest## # A tibble: 481 × 20
## fixed…¹ volat…² citri…³ resid…⁴ chlor…⁵ free.…⁶ total…⁷ density pH sulph…⁸
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65
## 2 5.6 0.615 0 1.6 0.089 16 59 0.994 3.58 0.52
## 3 8.9 0.62 0.19 3.9 0.17 51 148 0.999 3.17 0.93
## 4 8.1 0.56 0.28 1.7 0.368 16 56 0.997 3.11 1.28
## 5 7.9 0.32 0.51 1.8 0.341 17 56 0.997 3.04 1.08
## 6 8.9 0.22 0.48 1.8 0.077 29 60 0.997 3.39 0.53
## 7 7.6 0.39 0.31 2.3 0.082 23 71 0.998 3.52 0.65
## 8 8.5 0.49 0.11 2.3 0.084 9 67 0.997 3.17 0.53
## 9 7.6 0.41 0.24 1.8 0.08 4 11 0.996 3.28 0.59
## 10 6.9 0.685 0 2.5 0.105 22 37 0.997 3.46 0.57
## # … with 471 more rows, 10 more variables: alcohol <dbl>, quality <fct>,
## # type <chr>, .pred_class <fct>, .pred_3 <dbl>, .pred_4 <dbl>, .pred_5 <dbl>,
## # .pred_6 <dbl>, .pred_7 <dbl>, .pred_8 <dbl>, and abbreviated variable names
## # ¹fixed.acidity, ²volatile.acidity, ³citric.acid, ⁴residual.sugar,
## # ⁵chlorides, ⁶free.sulfur.dioxide, ⁷total.sulfur.dioxide, ⁸sulphates
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
rboosted_acc_test## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.674 White Boosted Trees Model
rboosted_rocauc_test## # A tibble: 1 × 4
## .metric .estimator .estimate model_type
## <chr> <chr> <dbl> <chr>
## 1 roc_auc hand_till 0.774 White Boosted Trees Model
rBoostedROCCurveTestingrBoostedConfusionMatrixTestingred conclusion